NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Data-Efficient Policy Evaluation Through Behavior Policy Search

Hanna, Josiah P; Chandak, Yash; Thomas, Philip S; White, Martha; Stone, Peter; Niekum, Scott (October 2024, Journal of machine learning research)
Ravikumar, Pradeep (Ed.)
We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for a minimal variance behavior policy -- a behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates.
more » « less
Full Text Available
High-Confidence Off-Policy (or Counterfactual) Variance Estimation

Chandak, Yash; Shankar, Shiv; Thomas, Philip (April 2021, Proceedings of the AAAI Conference on Artificial Intelligence)
null (Ed.)
Full Text Available
High Confidence Generalization for Reinforcement Learning

Kostas, James; Chandak, Yash; Jordan, Scott; Theocharous, Georgios; Thomas, Philip (July 2021, Proceedings of Machine Learning Research)
null (Ed.)
Full Text Available
Universal Off-Policy Evaluation

Chandak, Yash; Niekum, Scott; Castro da Silva, Bruno; Learned-Miller, Erik; Brunskill, Emma; Thomas, Philip (December 2021, Advances in neural information processing systems)

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a universal off-policy estimator (UnO)—one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss UnO’s applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.
more » « less
Full Text Available
Towards Safe Policy Improvement for Non-Stationary MDPs

Chandak, Yash; Jordan, Scott; Theocharous, Georgios; White, Martha; Thomas, Philip (January 2020, Advances in neural information processing systems)
null (Ed.)
Full Text Available

Search for: All records